Overview¶

Our research question asks: Between 2001 and 2021, have state-level overdose prevention policies (such as Naloxone access laws and public treatment services) been associated with reductions in statewide opioid fatality rates?

The EDA is conducted across 3 datasets, which contain the dependent variable of statewide overdose rates (current file), and the predictor variables of naloxone access policies and treatment services.

Dataset 1: Opioid Overdose Rates Across States (Dependent Variable)¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import json

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px
import plotly.io as pio
/opt/conda/lib/python3.9/site-packages/geopandas/_compat.py:111: UserWarning:

The Shapely GEOS version (3.10.3-CAPI-1.16.1) is incompatible with the GEOS version PyGEOS was compiled with (3.10.4-CAPI-1.16.2). Conversions between both will be slow.

In [3]:
kff_path = 'data/overdose_data.csv'

df = pd.read_csv(kff_path)

#drop unnecessary rows 
df = df.head(52)

Description of Data¶

This data was generated by KFF, a non-partisan (albeit slightly republican leaning) organization focused on health policy research and polling. Collected data dates back to 1999.

This data was collected and compiled with the intent to track the mortality impact of opioids and opioid use across American states with the intent to be utilized for state policy, medical policy, or informed journalism. This data is directly applicable to our research question, as being able to track the deaths by state by year will ideally reflect the efficacy of the harm-reduction policies implemented contemporaneously to prevent opioid related deaths. This is one of the central goals of our research question.

A strength of this data is that it's very clean and straightforward with only three columns, simply reflecting the number of opioid-related deaths per 100,000 residents of a state across around two decades. The main dependent variable is the number of deaths per 100k residents, with the corresponding state and year as the only other features.

A potential weakness is that this data has a relatively significant proportion of null values for certain states in earlier years of recording, which may degrade the accuracy of our assessment models.

This data was collected through the Vital Statistics Cooperative Program and was provided by the 57 vital statistic jurisdictions. Data was also sourced from the publically available Multiple Cause of Death Files by the CDC, which are strongly regulated by federal law to protect privacy and prevent misuse. Thus, there is no reason to suspect any divergence from ethical data collection standards.

Data Cleaning¶

In [4]:
"""
Cell Description: 
The dataset that's read in is in a format similar to that of a pivot table.
This cell deconstructs this pivot table into the features below:
'Location', 'Year', 'Deaths_Per_100k' 
"""
result = pd.melt(df, 
            id_vars='Location', 
            value_vars=list(df.columns[1:]), # list of days of the week
            var_name='Year', 
            value_name='Deaths_Per_100k')
result
Out[4]:
Location Year Deaths_Per_100k
0 United States 1999 2.9
1 Alabama 1999 0.8
2 Alaska 1999 4
3 Arizona 1999 4.7
4 Arkansas 1999 1.1
... ... ... ...
1191 Virginia 2021 26.0
1192 Washington 2021 20.5
1193 West Virginia 2021 77.2
1194 Wisconsin 2021 25.9
1195 Wyoming 2021 12.4

1196 rows × 3 columns

In [5]:
"""
Cell Description: 
We replace data-specific labels with np.nan's.
We divide the dataset into one with and one without the aggregated 'US' Location.
We get the correct time frame for our data.
We sort for ffill later on.
"""
result['Deaths_Per_100k'] = result['Deaths_Per_100k'].replace(['NR', 'NSD'], np.nan)
result['Year'] = result['Year'].astype(int)
result['Deaths_Per_100k'] = result['Deaths_Per_100k'].astype(float)

result_with_us = result.copy()

result = result[result['Location'] != 'United States']
result = result[result['Year'] >= 2001]
result.sort_values(['Location', 'Year'], inplace=True)

Handling Null Values¶

Fortunately, there is a low percentage of null values across the dataset. North Dakota had the highest proportion of nulls, and most of the null values tend to skew towards the earlier years. As such, we decided to use the forward fill function to inpute values with the next avaiable data point, because death rates increase over time but tended to be relatively stable in earlier years.

In [6]:
print('Total Number of Nulls: ', result.isna().sum().sum())
print('Percent of Nulls: ', result.isna().sum().sum()/result.shape[0])
Total Number of Nulls:  29
Percent of Nulls:  0.02707749766573296
In [7]:
"""
Cell Description: 
We plot the number of nulls in our target variable per state.
"""
null_states = result.groupby('Location').agg(lambda x: x.isna().sum())['Deaths_Per_100k']
null_states[null_states>0].sort_values().plot.bar(title='Total Null Values per State', ylabel='count nulls')
Out[7]:
<AxesSubplot:title={'center':'Total Null Values per State'}, xlabel='Location', ylabel='count nulls'>
In [8]:
"""
Cell Description: 
We plot the number of nulls per year.
"""
null_year = result.groupby('Year').agg(lambda x: x.isna().sum())['Deaths_Per_100k']
null_year[null_year>0].plot.bar(title='Total Null Values Per Year', ylabel='count null')
Out[8]:
<AxesSubplot:title={'center':'Total Null Values Per Year'}, xlabel='Year', ylabel='count null'>
In [9]:
"""
Cell Description: 
We use forward fill to fill in null values in our target variable.
"""
result.loc[:,'Deaths_Per_100k'] = result.loc[:,'Deaths_Per_100k'].ffill()
print('Total Number of Nulls: ', result.isna().sum().sum())
Total Number of Nulls:  0

Visualizations¶

Summary Stats

In [10]:
result.agg({"Deaths_Per_100k": ["min", "max", "median", "skew"],})
Out[10]:
Deaths_Per_100k
min 0.700000
max 77.200000
median 7.800000
skew 2.041374
In [11]:
result['Deaths_Per_100k'].describe()
Out[11]:
count    1071.000000
mean       10.996639
std         9.359158
min         0.700000
25%         4.950000
50%         7.800000
75%        13.550000
max        77.200000
Name: Deaths_Per_100k, dtype: float64
In [12]:
result.shape
Out[12]:
(1071, 3)

From the histogram below and the summary stats, we see that the data is heavily right skewed, with the median being 7.8 deaths per 100k people but the max being 77.2 deaths; resulting in a high standard deviation of 9.35. The max value is significantly greater than the 75th percentile mark of 13.55

In [13]:
deathshist = result['Deaths_Per_100k'].hist()
plt.title("Histogram of Deaths_Per_100k")
plt.ylabel("Count")
plt.xlabel("Number of Deaths")
Out[13]:
Text(0.5, 0, 'Number of Deaths')
In [14]:
plt.violinplot(result['Deaths_Per_100k'])
plt.title("Violin Plot of Deaths_Per_100k")
plt.xlabel("Density")
plt.ylabel("Count")
Out[14]:
Text(0, 0.5, 'Count')

Between 2001 and 2021, nationwide overdose rates grew exponentially; increasing by 150% between just 2015 and 2021.

In [24]:
nat_deaths = result_with_us[result_with_us['Location']=='United States']

fig = px.line(nat_deaths, x="Year", y="Deaths_Per_100k", title='Nationwide Deaths Per 100k')
fig.show(renderer='notebook')

Within individual states, death rates also seemed to increase rather rapidly in recent years. Kentucky, Tenessee, Maine, and Deleware experience higher overdose rates at around 44 deaths per 100k, while West Virginia surpasses the rest with 77 deaths per 100k.

In [16]:
us_state_to_abbrev = {
    "Alabama": "AL",
    "Alaska": "AK",
    "Arizona": "AZ",
    "Arkansas": "AR",
    "California": "CA",
    "Colorado": "CO",
    "Connecticut": "CT",
    "Delaware": "DE",
    "Florida": "FL",
    "Georgia": "GA",
    "Hawaii": "HI",
    "Idaho": "ID",
    "Illinois": "IL",
    "Indiana": "IN",
    "Iowa": "IA",
    "Kansas": "KS",
    "Kentucky": "KY",
    "Louisiana": "LA",
    "Maine": "ME",
    "Maryland": "MD",
    "Massachusetts": "MA",
    "Michigan": "MI",
    "Minnesota": "MN",
    "Mississippi": "MS",
    "Missouri": "MO",
    "Montana": "MT",
    "Nebraska": "NE",
    "Nevada": "NV",
    "New Hampshire": "NH",
    "New Jersey": "NJ",
    "New Mexico": "NM",
    "New York": "NY",
    "North Carolina": "NC",
    "North Dakota": "ND",
    "Ohio": "OH",
    "Oklahoma": "OK",
    "Oregon": "OR",
    "Pennsylvania": "PA",
    "Rhode Island": "RI",
    "South Carolina": "SC",
    "South Dakota": "SD",
    "Tennessee": "TN",
    "Texas": "TX",
    "Utah": "UT",
    "Vermont": "VT",
    "Virginia": "VA",
    "Washington": "WA",
    "West Virginia": "WV",
    "Wisconsin": "WI",
    "Wyoming": "WY",
    "District of Columbia": "DC",
}
In [17]:
#create df for choropleth plotting, with state names replaced with proper codes
cp_df = result.copy()
cp_df.replace({'Location':us_state_to_abbrev}, inplace=True)
cp_df
Out[17]:
Location Year Deaths_Per_100k
105 AL 2001 1.3
157 AL 2002 1.6
209 AL 2003 1.1
261 AL 2004 1.8
313 AL 2005 1.8
... ... ... ...
987 WY 2017 8.7
1039 WY 2018 6.8
1091 WY 2019 8.3
1143 WY 2020 10.6
1195 WY 2021 12.4

1071 rows × 3 columns

In [18]:
min = cp_df['Deaths_Per_100k'].min()
max = cp_df['Deaths_Per_100k'].max()

fig = px.choropleth(cp_df, 
              locations = 'Location',
              color= 'Deaths_Per_100k', 
              animation_frame="Year",
              color_continuous_scale="Sunset",
              locationmode='USA-states',
              scope="usa",
              range_color=(min, max),
              title='Deaths Per 100k Over Time',
              height=600
             )
fig.show(renderer='notebook')

Change in Deaths per State Over Time

In [19]:
"""
Cell Description: 
Creating Pivot Table so we can create a pct_change dataframe.
"""
result_pivot = pd.pivot_table(result, 
                              values='Deaths_Per_100k', 
                              index='Year',
                              columns='Location', 
                              aggfunc=np.sum)
result_pivot = result_pivot.astype(float)
In [20]:
result_pivot.isna().sum().sum()
Out[20]:
0
In [21]:
"""
Cell Description: 
Creating pct_change() table
"""
result_pivot_chg = result_pivot.pct_change()
In [22]:
"""
Cell Description: 
Plotting pct_change() table(s)
"""
for i in range(0, result_pivot_chg.index.size, int(result_pivot_chg.index.size/5)):
  start_index, end_index = i, i + int(result_pivot_chg.index.size/5)
  print("-------------")
  print("Start Index: ", start_index)
  print("End Index: ", end_index)
  subset_pct = result_pivot_chg.iloc[:, start_index:end_index]
  subset_pct.plot(xticks=subset_pct.index, ylabel='Pct Change', figsize=(8, 8), rot=45)
  plt.show()

# result_pivot_chg.plot(xticks=result_pivot_chg.index, ylabel='Pct Change', figsize=(8, 8), rot=45)
# plt.legend(ncol=3, fontsize=5)
# plt.show()
-------------
Start Index:  0
End Index:  4
-------------
Start Index:  4
End Index:  8
-------------
Start Index:  8
End Index:  12
-------------
Start Index:  12
End Index:  16
-------------
Start Index:  16
End Index:  20
-------------
Start Index:  20
End Index:  24